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Summary. Often the relation between the variables constituting a multivariate 
data space might be characterized by one or more of the terms: "nonlinear", 
"branched", "disconnected", "bended", "curved", "heterogeneous", or, more gen- 
eral, "complex". In these cases, simple principal component analysis (PCA) as a 
tool for dimension reduction can fail badly. Of the many alternative approaches 
proposed so far, local approximations of PCA are among the most promising. This 
paper will give a short review of localized versions of PCA, focusing on local prin- 
cipal curves and local partitioning algorithms. Furthermore we discuss projections 
other than the local principal components. When performing local dimension reduc- 
tion for regression or classification problems it is important to focus not only on the 
manifold structure of the covariates, but also on the response variable(s). Local prin- 
cipal components only achieve the former, whereas localized regression approaches 
concentrate on the latter. Local projection directions derived from the partial least 
squares (PLS) algorithm offer an interesting trade-off between these two objectives. 

We apply these methods to several real data sets. In particular, we consider 
simulated astrophysical data from the future Galactic survey mission Gala. 
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1 Introduction 

Principal component analysis is a well established tool for dimension reduc- 
tion. For data X = {x'l, . . . ,x'„y with Xi G W the principal components 
provide a sequence of best linear approximations to it. Let S be the empirical 
covariance matrix of X , then the principal components are given by the eigen 
decomposition 

s = rAr' (1) 

where A = diag(Ai, . . . , Xp) is a diagonal matrix containing the ordered eigen- 
values of S, with Al > . . . > Ap, and F is an orthogonal matrix. The columns 
of P = (71 , . . . , 7p) are the eigenvectors of S and are called principal compo- 
nent loadings. The first loading 71 maximizes the variance of the scores X7 
over all j gMP with ||7|| = 1, the second loading 72 maximizes the variance 
of X7 over all 7 € M^* with ||7|| = 1 which arc orthogonal to 71, and so on. 

As an example, we consider a speed-fiow diagram recorded on the freeway 
4-W in Contra Costa County, California (figure 1(a)).'* The displayed points 
correspond to the average speed and flow over 5 minute intervals. The data 
were collected on 10th December 2006; this was a Sunday, so the traffic flow 
was quite low, not exceeding 1300 vehicles per hour. Most cars went at a speed 
close to the speed limit. One can easily imagine a first principal component 
line git]) = x + r] ■ (with overall mean x) fitted through the data cloud, 
capturing the main part (here: 99.54%) of the variance in this data set (figure 
1(a)). Clearly, the projection indices rji = (xi — S)^7i of the data projected 
onto this line are good indicators for the positions of the data points within 
the data cloud. 

However, the application of principal component analysis postulates im- 
plicitly some form of linearity. More precisely, one assumes that the data cloud 
is directed, and that the data points can be well approximated by their pro- 
jections onto the line g corresponding to the first principal component (or 
in general the affine hyperplane corresponding to the first d principal compo- 
nents). This implicit assumption of linear PC A can already be violated in very 
simple situations. As an example, consider the speed-flow diagram recorded 
three days earlier with the same detector at the same location. This was a 
Thursday and traffic was busy, leading to occasional congestion. As a con- 
sequence, cars often had to reduce their speed at higher flows, resulting in 
the data cloud shown in figure 1(b). This kind of pattern has frequently been 
reported in the transportation science literature, see e.g. [27]. 

These data are somewhat half-moon shaped and it does not seem to make 
much sense to speak of some principal direction for this data. The projection 
indices onto any straight line like the first principal component (figure 1(b)) 
would be uninformative with respect to the position of the data within this 
bent data cloud. This will certainly get worse if the the data cloud is strongly 



The data are taken from the database PemS [45]. 
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twisted, if it has crossings, if it consists of several branches, if the curvature 
permanently changes, or if there are several disconnected clouds, and so on. 

Solutions to problems of this kind are readily available, and fall into two 
major categories: Nonlinear principal component analysis (hereafter: NLPCA) 
and principal curves (or their multivariate extension, principal manifolds) . The 
dividing lines between these two approaches are often rather fuzzy, but the 
following vague rule can be used to distinguish between them. Principal curves 
is a nonparametric extension of linear PCA, while NLPCA is a parametric (or 
semi-parametric), but nonlinear version of PCA. Most, but not all algorithms 
belonging to either of the categories accomplish their task in two steps: 

Projection Define a dimension reducing transformation f{x) : — > K'' 
Reconstruction Find a mapping back to the data space g{r]) ■.R'^^W 

Here, the vector rj represents a d— dimensional parameterization of the 
p— dimensional data space. Summarizing, the reconstructed curve is given by 
g{f{x)), which is mostly found by (implicitly or explicitly) minimizing the 
reconstruction error 

nx-g{f{x))\\\ (2) 

or its empirical counterpart 

n 

Y,\\^i-9U{xi))f. (3) 

i=l 

PCA can be seen as a special case of the above algorithm. It follows directly 
from the extremal properties of the eigenvectors that the first d principal 
components define the linear, rf-dimensional functions minimizing equations 
(2) or (3). In the special case of PCA f{x) = (71 • • -7^)' {x — /x), girf) = 
(71 • ■•Idin-, and {g o f){x) = /ii + (71 • • • 7d) (71 • ■■fd)' {x - fi), where 
ju = E(x). If all principal components are used (i.e. d = p) then it is easy to 
see that g o / is the identity and the reconstruction error vanishes. 

A further difference between principal curves and NLPCA is that the pro- 
jection / in NLPCA has to be continuous whereas the projection in prin- 
cipal curves can be discontinuous [37]. An important concept for NLPCA 
as well as for principal curves is that of self-consistency, implying that the 
reconstructed curve is the mean over all data with equal projection, i.e 
g{ri) = 'E{x\f{x) = ry). This idea is the cornerstone of the original princi- 
pal curve approach by Hastie and Stuetzle [28] (hereafter: HS) as well as for 
the NLPCA by Bolton et al. [5] , who also consider all possible combinations of 
linear and nonlinear projection and reconstruction. A comparison of NLPCA 
and HS principal curves can be found in [39]. 

A special type of principal curves, a so-called local principal curve ([16], 
see Section 2.2), is shown for the second of the two data sets in figure 1. 

This article is concerned with extensions of PCA based on localization. In 
Section 2 we review such methods and also present a new approach based on 
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Fig. 1. Speed flow diagrams for Freeway 4-W on 10th (a) and 7th (b) of December 
2006, with principal component lines (straight Unes) and a principal curve through 
the latter data set (smooth curve). 



local partitioning. In Section 3 we discuss how principal component regres- 
sion can be localized. Furthermore we discuss projection directions other than 
the principal components. Using the projection directions used in the partial 
least squares (PLS) algorithm allows for taking both the inherent structure of 
the covariates and the response variable into account. In Section 4 we apply 
these methods to simulated data produced in order to develop classification 
and parametrization methods for the upcoming Gaia astronomical survey mis- 
sion. This survey will obtain spectroscopic data (i.e. high dimensional vectors) 
on over one billion (10^) stars, from which wc wish to estimate intrinsic as- 
trophysical parameters (temperature, abundance, surface gravity). 



2 Localized principal component analysis 

The term "local(izcd) PGA" was coined decades ago in the literature on di- 
mension reduction and feature extraction. However, it has not been used in a 
unique way; there are several ways of interpreting the term local(ized). In this 
section, we will give a brief review of types of local PGA and illustrate some of 
them by applying them to the second speed-flow diagram presented in Section 
1. In addition, we consider another artificial data set (corresponding to figure 
3d in [16]), representing a spiral- like data cloud, with which principal curve / 
NLPGA algorithms generally tend to have problems. 
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2.1 Cluster- wise PC A 

Historically the term localized PCA was used for cluster-wise PCA. Instead 
of fitting principal components through the whole data set, cluster-wise PCA 
partitions the data cloud into clusters, within which the principal component 
analysis is carried out "locally" . 

Originally cluster-wise PCA was proposed as a tool for exploratory data 
analysis [24, 6]. It was rediscovered by the neural network community in 
the 1990s in the context of non-linear signal extraction as an alternative to 
auto-associative neural networks [33]. Five- layer auto-associative neural net- 
works were successfully employed for (global) nonlinear PCA [40], using input 
and output layers with p units, and a hidden layer with d < p units. Auto- 
associative neural networks possess many desirable theoretical properties in 
terms of signal approximation [30]. However neural networks typically yield 
a non-convex optimization problem that is burdensome to solve, and the al- 
gorithm is — without suitable regularization — prone to getting trapped in 
poor local optima. Empirical evidence [33] suggests that cluster-wise PCA is 
at least on a par with auto-associative neural networks. 

The cluster-wise PCA algorithm can be seen as a generalization of the 
Generalized Lloyd algorithm ("k- means") [25]. In the k- means algorithm the 
cluster centers are points, whereas in localized PCA, the cluster centers are 
hyperplane segments. The outline of the the cluster-wise PCA algorithm is as 
follows [18]: 

Algorithm 1: Cluster- wise ("local") PCA 

1. Choose a target dimension d {d = 1 e.g. yields a line), a number of clusters 
Q, and an initial partitioning of the input space into Q disjoint regions 

u . . . u ii(^) = w. 

2. Iterate . . . 

i. For each partition R^'^^ {q = 1, . . . ,Q) compute the "local" covariance 
matrices 



in partition R^'^K Obtain the local principal components f i , ■ ■ ■ ,7^^ 
from the eigen decomposition of U'^'^^ (with corresponding eigenvalues 

A(')>...>Ai^) >...). 
ii. Update the partitioning R^^^ , . . . , R^'^^ by allocating each observation 
Xi to the "nearest" partition, i.e. the partition whose hyperplane seg- 
ment is closest to Xi. 



Note that it is important in step 2.ii. to only consider the hyperplane 



segment jaf') + J^"^^! Cj7j''^ : 6, • • • 6i e k} n R'-'^^ instead of the whole hy- 



perplane. Originally the partitioning was fixed a priori and thus there was 
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no need to iterate steps i. and ii. This much simpler setup however is typi- 
cally detrimental to the performance of the algorithm. A variant of algorithm 
1 featuring automatic selection of clusters, and of the number of principal 
components within clusters, was proposed by Liu et al. [36] and used for 
star/galaxy classification. 

It is easy to see that in complete analogy to the k-means algorithm the sum 
of squared distances (3) is never increased by any step of the above algorithm. 
However as with k-mcans or aiito-associativc nciiral networks, there is no 
guarantee that the global optimum is found. The most important drawback 
of this method is that its results are highly dependent on the initial choice of 
the partitioning. 

This gives rise to the idea of slowly "building up" the partitions recursively 
akin to classification and regression trees (CARTs) [7] . Starting with a single 

partition, each partition is recursively "split up" into two partitions if the data 
can locally be better be approximated by two hyperplane segments instead of 
a single one. 

In order to evaluate whether such a split is necessary, every partition i?*^'^ 
is split at the mean of the partition x''-' orthogonally to the first principal 
component 7''^ of the data in partition q, yielding two partitions i?^'^ and 
R^'^\ Next a small number of "k-scgments" steps (steps 2.i. and 2.ii. from 
algorithm 1) are carried out for the partitions i?*^'-* and R^'^\ however only 
using the data initially belonging to the partition R^'^h The split is retained 
if 

A(^)+^^^+Af /n(0 A^.-. + Ai'^ x[^^ + . . . + X^J^ ] 

a(«) + . . . + a(«) ■ [nC'^ ■ A^n . . . + A« ' x(;^ + ... + x^;^ I ' 

(4) 

where n^^^ is the number of observations in partition i?^') , A^'^ the fc-th largest 
eigenvalue of the covariance of the data in partition i?*^^) , and C is a constant 
which is typically chosen to be 1. Note that, contrary to CARTs, only a single 
split is considered for each partition. Considering every possible split of the 
partition would be computationally wasteful and in most cases not necessary 
as the split is subsequently optimized using a few "k-segments" steps. 

In addition to testing whether partitions should be split, one can test 
whether neighboring partitions should be joined. This allows the algorithm 
to get around some of the suboptimal local extrema. In our experience it is 
beneficial to use a rather loose definition of neighborhood. Two partitions 
and are hereby considered to be neighbors if for at least for one 
observation Xi, both i?^') and R^^'> are amongst the s > 2 "closest" partitions. 
If the inequality (4) does not hold for two neighboring partitions i?^'^ and 
these are joined forming a new partition 

This yields the following new algorithm: 

Algorithm 2: Recursive local PCA 

1. Start with a single partition i?^^^ containing all the data. 
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2. Iterate . . . 

i. Test for each partition R^'^^ whether it should be split (using the cri- 
terion (4)). 

ii. Carry out a fixed number of "k segments" steps (steps 2.i. and 2.ii. of 

algorithm 1) updating all partitions. 

iii. Test for each neighboring pair of partitions whether the pair should be 
joined (using the criterion (4)). 

. . . uutil tlicro is uo (•lian,u,o iu the allocation of ol)S(^r\-atioiis to i)artitioiis. 



Note that this algorithm does not require the choice of an initial parti- 
tioning. It can be beneficial to "enforce" a certain number of splits during the 
first few iterations, i.e. initially carry out every possible split irrespective of 
whether it meets criterion (4) and initially skip step 2.ii. 

A similar idea has been proposed by Verbeek et al. [46]. Their method is 
however based on splitting off "zero length" segments corresponding to each 
observation. 

Figure 2 illustrates the algorithm using a simple example. The data cannot 

be summarized by its first principal component (top left panel) . Whilst it can 
be summarized by two line segments, the two line segments found in the second 
step are no truthful representation of the data (top right panel). The four line 
segments found in the third step correspond to the structure behind the data 
(bottom right panel). Finally the adjacent segments are joined leading to two 
partitions only (bottom right panel). 

Note that the algorithm docs not yield a principal curve or manifold, but 
merely disconnected line or hyperplane segments. It can be seen as finding 
tangent approximations to the principal curve or manifold. Thus the algo- 
rithm can by design cope with disconnected or branching principal curves or 
manifolds. Note that in the case of principal curves {d = 1) one can join the 
line segments to build a polygonal line [46]. 

2.2 Principal curves 

Principal curves were firstly introduced by Hastie and Stuetzle [28]. Their 
definition is based on the concept of self- consistency. A curve g is called a 
Hastie-Stuetzle (HS) principal curve if g{r]) = E{x\f]g{x) = rj), with projec- 
tion index r]g{x) := argmin^ \\x — g{rj)\\. Hastie and Stuetzle show that a 
HS principal curve is a critical point of the distance function (2). Hastie and 
Stuetzle's definition however has a number of shortcomings. Whilst the prin- 
cipal curve can be shown to be a critical point of the distance function, one 
can show under fairly general conditions that it is just a saddle point of the 
distance function [14] . Further, when the data x are generated by adding noise 
e with Ee = to a curve g, i.e. x = g{r]) + e, g is typically not the principal 
curve. Kegl et al. [34] overcome the former problem by considering principal 



8 Einbeck et al. 



(1) First principal component (2) Splitting into 2 paritions 




-10 -5 5 10 -10 -5 5 10 



(3) Splitting into 4 paritions (4) Joining adjacent partitions 




-10 -5 5 10 -10 -5 5 10 



Fig. 2. Example illustrating the recursive local PCA algorithm. 

curves of fixed length. Tibshirani's definition of principal curves [43] over- 
comes the latter shortcoming by defining principal curves using a generative 
model. 

The algorithms proposed by Hastie and Stuetzle, Tibshirani, and Kegl can 
all be described as "top-down" (or "global") algorithms. All start with the 
first principal component and then iteratively "bend" it. The algorithm by 
Hastie and Stuetzle can be summarized as follows. 

Algorithm 3: "Top-down" principal curves (HS) 

1. Initialize the principal curve as the first principal component line g{ri) = 

X + 7?7i. 

2. Iterate between . . . 

i. Projection Project the data points Xi onto the principal curve g 
yielding projection indexes r]g{xi) = rji. 
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ii. Reconstruction Fit the data points Xi component-wise against the 

projection indices rji using a scattcrplot smoother. 
. . . until the change of some measure of goodness-of-fit falls below a certain 
threshold. The curve g : M — > V reconstructed in the last iteration is the 
estimated principal curve. 



The reconstruction step requires some form of smoothing, otherwise all 
data points would just be interpolated. It is implemented by using splines or 
local smoothers such as LOESS [10]. Note that this algorithm entails that 
the order of the projections rji is maintained in each iteration. Tibshirani's 
approach yields a similar algorithm. As Tibshirani's definition does not assume 
errors that are orthogonal to the principal curve the observations cannot be 
directly associated to a single point on the curve. Rather the algorithm is 
based on weights ( "responsibilities" ) that correspond to the likelihood that a 
certain observation was generated by a certain point on the curve. From this 
point of view Tibshirani's algorithm is very similar to the EM algorithm for 
Gaussian mixture models. Kegl et al.'s algorithm uses polygonal lines with an 
increasing number of nodes. 

Though all the three algorithms were shown to give satisfying results in 
a variety of circumstances, they all suffer from the problem of being highly 
dependent on the initialization. An unsuitable initialization of the projection 
indices rji cannot be corrected later on. This is particularly obvious for spiral- 
like data, see figure 3 (top). 

Hence, instead of starting with a global initial line, it often seems more 
appropriate to construct the principal curve in a "bottom-up" manner by 
considering in every step only data in a local neighborhood of the currently 
considered point. Delicado [12] proposed the first principal curve approach 
which can be assigned to this family, the principal curves of oriented points 
(PCOP). A second approach in this direction was recently made by Einbeck 
et al. [16] and is known as local principal curves (LPC). These two algorithms 
differ from all other existing NLPCA / principal curve algorithms in several 
crucial points: 

• They do not start with an initial globally constructed line like the first 

principal component. 

• They do not dissect into the stages of projection and reconstruction. The 
entire fitting process is carried out in the data space only. 

• They do not maximize/minimize a global fitting criterion. 

We illustrate this family of methods by providing the LPC algorithm ex- 
plictly. To fix terms, let Kh{-) be a p— dimensional kernel function, H a 
multivariate (pxp) bandwidth matrix, wf = Knixi — x) / Knixi — x) , 
and x[ = (xii, ■ • ■ , Xip), i = l,...,n. 

Algorithm 4: Local principal curves (LPC) 

1. Select a starting point Xq and a step size to. Set x = xq- 
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2. Calculate the local center of mass /x*^ = Xi at x. Denote by /xj 

the j-th component of fi^ . 

3. Estimate the local covariance matrix = (crj^) at x via 

^Jfc = Er=i - ^iJ){x^k - fit)- 

Let 7^ be the first column of the loadings matrix computed locally at 
X in analogy to (1). 

4. Setting x := fi^ + ioT^', one finds the updated value of x. 

5. Repeat steps 2 to 4 until the sequence of ju'* remains approximately con- 
stant (implying that the end of the data cloud is reached). Then set again 
X = xo, set := —7'" and continue with step 4. 

The step size to is recommended to be set equal to hif H = diag(/i, . . . ,h). 

The starting point Xq may or may not be a member of the data cloud, and 
can be selected by hand or at random. Depending on the particular data set, 
it can have a rather crucial influence on the fitted curve. (In the examples 
provided in figure 3 (middle), the spiral fit is quite independent of Xq, but 
it may require some attempts to get the fit through the traffic data right. 
A successful (randomly selected) choice was here e.g. xq = (93,71), using 
h = to = 7.) For branched or disconnected data clouds it is often useful 
to work with multiple initializations [15, 16] and to compose the principal 
curve from the individual parts resulting from each starting point. In order to 
improve stability it is beneficial to replace the first local principal component 
7^^ by the weighted average between the current local principal component 7^ 
and the previous local principal component 7'*°id^ j g ccy'^ + (1 — Q;)7'"°" 
for a suitable a G (0, 1] ("angle penalization", see [16]). 

To make the difference to Delicado's algorithm clear, let ?i(a;, b) be the 
hyperplane which contains x and is orthogonal to a vector b. Then, for given 
X, Delicado defines the principal direction b*(x) as the vector minimizing 
the total variance of all observations lying on this hyperplane, and lJ-*{x) as 
the expectation of all points lying on H{x,b*{x)). The set of fixed points of 
/i* defines the set 'P(-X') of principal oriented points (POPs), and any curve 
a :Rd I — ^RP with {a{s) : s G /} € V{X) is a PCOP. This is a proba- 
bilistic definition, hence estimates are required in practice, and this is where 
localization enters into Delicado's algorithm in two different ways. Firstly, 
one considers a certain neighborhood of the current hyperplane Ti., and data 
nearer to H are associated with higher weights. Secondly, a cluster analysis is 
performed on 7i, and only data in the local cluster are used for averaging. In 
contrast to the LPCs, where the local center of mass is found in a one-step 
approximation, Delicado's analogue to step 2 requires iteration until conver- 
gence. Summarizing the essential difference slightly simplified, whereas LPC 
is based on calculating alternately a local first principal component 7f and 
a local center of mass fi^ , in Delicado's algorithm these two components are 
replaced by the estimated principal direction and the estimated fixed points 
(POPs), respectively. While Dehcado's algorithm is based on an elaborated 
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and sound probabilistic theory, it is quite burdensome from a computational 
point of view, as his principal directions arc not obtainable through an cigcn 
decomposition [16], and a large number of cluster analyses has to be run. 
On the other hand, the computationally faster LPCs are a purely empirical 
concept, but they can be shown to be an approximation of Dclicado's algo- 
rithm for small cluster sizes [16]. For either algorithm, extensions to branched 
principal curves using the notion of local second principal curves have been 
suggested [12, 15] and automatic bandwidth selection routines have been pro- 
posed [13, 16]. 

2.3 Further approaches 

The term local PC A has also been used for other tcchniciues and in other 
contexts, which we briefly summarize here. Firstly, one can extend the algo- 
rithms presented in Section 2.1 by allowing for smoothness over clusters using 
mixture models [38]. This means, rather than allocating an observation Xi to 
the nearest partition q, one considers the posterior probabilities tt^ ,j that Xi is 
generated by partition q, and defines for each observation the weighted (first) 
local principal component ji^i = ^q''^i,qJi^^ ■ This method requires knowl- 
edge of the mixture density itself and produces eigenvectors which are biased 
towards adjacent components, which motivated [38] to provide a modified 
algorithm using Fisher's linear discriminant (FLD) instead of PCA. 

Secondly, there are approaches from the NLPCA family using local meth- 
ods. While most NLPCA algorithms work by applying some nonlinear, but 
parametric, transformation in the projection and/or the reconstruction step 
([47], [8], among others), Bolten et al. [5] allowed for a nonparametric recon- 
struction g. Using in the projection step a nonlinear transformation followed 
by a linear mapping, i.e. f{x) = W0(a;), (f) : W ^ R\W & R''^', the 
reconstructed curve takes the form g{W<p{x)), which is estimated using pro- 
jection pursuit regression [23]. The actual smoothing method employed is the 
so-called supersmoother [21], which is a running line smoother using a lo- 
cal neighborhood of the target point with variable span. We do not consider 
NLPCA approaches further, as the contribution by U. Kruger in this volume 
is explicitly dedicated to them. 

Thirdly, the term "local principal component analysis" was used by Aluja- 
Banet and Nonell- Torrent [2] for PCA using contiguity relations based on a 
non-directed graph expressing binary relations between the individuals. Their 
motivation is to control for the effect of a latent or third variable (e.g. ge- 
ographical position) by virtually eliminating its influence on the principal 
component analysis. This falls somewhat out of the context of the present 
paper and we do not consider it further. 
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Spiral Traffic flow 




-1.0 -0.5 0.0 0.5 50 100 150 



Fig. 3. Principal curves obtained for the spiral data and the non-linear traffic 
flow data using algorithm 3 (top row), algorithm 4 (middle row), and algorithm 2 
(bottom row) 

3 Combining principal curves and regression 

3.1 Principal component regression and its shortcomings 

Using principal components for dimension reduction prior to carrying out 
further analysis is an old paradigm in statistics. The combination of princi- 
pal component analysis and linear regression, known as principal component 
regression (PGR), was proposed already in the 1950s by Kendall [35] and 
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Hotelling [31]. PGR consists of two steps. As a first step the first d principal 
components arc extracted from the covariatcs X . In the second step, the first 
d principal component scores := (ti • • • td) = X ■ (7^^^ • • • -y^'^^^ are used 
as covariates in a linear regression model. Geometrically, this corresponds to 
projecting the covariatcs onto the affinc hyperplanc spanned by the first d 
principal components and using the projection indices as covariates in the 
regression. 

The purpose of the projection step is to eliminate directions in which the 
covariates have little variance. The rationale behind this is that directions with 
little variance correspond to directions in which there is little information in 
the covariates and thus are prone to leading to an overfit. 

Being built upon principal component analysis, PGR suffers from the same 
shortcomings as PGA. If the structure of the covariates cannot be suitably ap- 
proximated by an afhne hyperplanc of low dimension, PGA and thus PGR are 
likely to fail. In the following we will propose a new regression-tree like algo- 
rithm, based on algorithm 2, which can be used for local dimension reduction 
of the covariate space of a regression model. 



3.2 The generalization to principal curves 

Instead of projecting the covariates onto the principal component surface we 
can project them onto a principal curve or manifold. In complete analogy to 
PGR we can then use the projection indices in the regression model. 

This requires a unique parameterization of the principal manifold (e.g. 
a suitable unit speed parameterization). Whilst some of the algorithms pre- 
sented in Section 2 compute the projection indices, other principal curve or 
manifold algorithms like algorithm 4 only yield a set of points on the curve 
or manifold. 

The problem of having to compute the projection indices onto a non-linear 
curve or manifold can be circumvented by using tangent approximations to 
the principal curve or manifold as provided by algorithm 2, because the pro- 
jections are then onto hyperplanc segments and not onto non-linear manifolds. 
The projections onto the hyperplanc segment corresponding to partition R^'^^ 
are 



(9) 



arg mm 



The (t^^', . . . , t-J^'') are easy to compute as they are either the orthogonal 
projection of the covariates onto the hyperplanc or a point on the boundary 
of the hyperplanc segment. 

In analogy to algorithm 2 one can fit a regression model separately in each 
partition. This, however, leads to a couple of disadvantages. First of all, the 
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prediction would be discontinuous at the boundaries between the partitions. 
Furthermore, keeping the "hard partitioning" can lead to a large variance of 
the predictions, especially if the number of partitions is large. In order to 
avoid these detrimental effects the partitioning is "softened" by the intro- 
duction of weights that reflect how far an observation Xi is from a certain 
hyperplane segment. In order to achieve this all observations are projected 
onto all of the hyperplane segments. For each partition R^'^'^ weights Lio'f'^ are 
associated with each observation Xi. The weight uj\'^^ decreases with increas- 
ing distance between the observation and the hyperplane, i.e. uj'f'^ := w{5^j^'^), 



where 5, = 



the «-th observation and its projection onto the hyperplane segment belong- 
ing to the partition R'^''\ and w : Rq — > Rq is a decreasing function, such 
as w{S) — e^'^'^. Figure 4 illustrates this idea. The steps of the proposed 
algorithm are as follows: 



is the squared distance between 



Fig. 4. Local regression lines in the partitions and resulting regression function for 
covariates lying on a circle in the two-dimensional plain. The vertical dimension 
corresponds to the response. Left: "hard partitioning"; right: "softened". 



Algorithm 5: Projection-based regression trees 

1. Carry out algorithm 2 yielding a set of partitions R^^\ . . . , i?'*^' with the 
corresponding hyperplane segments. 

2. Compute the squared distances 5^-''' = 

weights Wj-''' := 

3. For each partition . . . , 

Fit a regression model using the projections of the data T'^^ :— 
(^t'l^ ■ ■ ■ tli^^ as covariates, y as response, and uj^'^^ as observation 
weights. 

4. Compute the predictions as weighted means 



Xi 



X 



(9) 



and 
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where y -'^ is the prediction obtained from the regression model in partition 

Note that a wide variety of regression methods can be used in step 3. If 
the regression method does not support weights one can alternatively sample 
from aU observations using the oj^'^^ as samphng weights. 

Algorithm 5 is a prediction algorithm that is like CARTs based on recur- 
sive partitioning. In contrast to CARTs, which split up the covariate space by 
a series of (orthogonal) "cuts" , the method proposed here partitions the co- 
variate space according to a tessellation defined by the hyperplane segments. 
Figure 5 illustrates this comparison. As algorithm 5 takes the structure of the 
covariates into account, it provides some sort of regularization. Thus it tends 
to have a smaller variance than CARTs, however at the price of a potentially 
increased bias. In a boosting [20] context, this makes the method proposed 
here an interesting candidate for a weak learner. 
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(a) "Classical" CART 



(b) Projection-based classification tree 



Fig. 5. Comparison between a classification tree (CART) and a projection-based 

classification tree 



The algorithm presented here is similar to cluster-wise linear regression 
[42], however there a few major differences. Most importantly, a dimension 
reduction step is carried out in each cluster and the clusters are not con- 
structed around a single point, the cluster center, but around hyperplane 
segments. Furthermore the boundaries between the partitions are softened by 
the exponential weights. Note that the algorithm considered so far does not 
take the response variable y into account when constructing the partitions. 
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3.3 Using directions other than the local principal components 

The rationale behind projecting the data onto the principal component plane 
or a principal manifold was based on equating the variance of the covariates 
to the information the covariates provide. But this docs not necessarily imply 
that the directions with the largest variance necessarily provide the relevant 
information necessary to predict the response y.^ In other words, it might be 
beneficial to consider both the structure of the covariates and the response 
variable. 

We propose to modify algorithm 2 to take both these aspects into account. 
The core step of algorithm 2 were the "k-scgmcnts steps" used already in algo- 
rithm 1. In step 2.i. the principal components are computed in each partition. 
The core property of the principal components is that they maximize the vari- 
ance of the projections Var(X(')7). Clearly, this does not take the response 
y^'^) into account at all.^ 

If we were to choose the direction that predicts the response y^'^ 
best, we would choose the least squares regression estimate 

^^'■^ = (X^^' X^'''>) X^'*'>' y^'*\ which maximizes the squared correlation co- 
efficient {y^'^\ X^'^'>' j3) . Note that the latter criterion does not consider the 
variance of the covariates X^*) at all. 

It seems to be a suitable compromise to choose a projection direction 
7 that maximizes the product of the two aforementioned criteria, which is 
equivalent to maximizing the squared covariance between the X'^'^^^ and y^'\ 
i.e. 

Cov2(y(«),X(«)7) =p2(y(9)^X(«)7) • Var(X('')7) • Var(y(9)). 

This covariance is maximised by the projection direction of the PLS algorithm 
[49, 11]. This alternative projection direction can easily be incorporated into 
algorithms 1 and 2. Step 2.i. of algorithm 1 simply has to be replaced by the 
computation of the PLS projection direction obtained by carrying out a PLS 
regression with the data belonging to partition i?^*) . Using this modification 
typically improves the predictive performance. The following paragraph gives 
an example for this. 

3.4 A simple example 

In the following we will consider a data set taken from [9] to illustrate the 
algorithm proposed in this section. The data consists of measurements of the 
radial velocity of a spiral galaxy taken at 323 points in the area of sky which 
it covers. The different points in the area of the sky are determined by their 
north-south and east- west coordinates. The data is visualized in figure 6. It is 

' See [32] for a more detailed discussion. 

® X^'^^ contains hereby all covariates a;''-' from the partition R^''\ i.e. the Xi £ R^'^\ 
y^"^ is defined analogously. 
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easy to see that the measurements lie within a small number of slots crossing 
the origin. 

The data set was randomly split into a training set of 162, and a test 
set of 161 observations. Table 1 gives the ^2 error (and its standard devi- 
ation) obtained when predicting the radial velocity using different methods 
based on 1000 replications. The methods considered were algorithm 5 once 
with the principal components and once with the PLS projection direction, a 
linear model, generalized additive models (GAM) [29], multivariate adaptive 
regression splines (MARS) [22], and projection pursuit regression (PPR) [23]. 
The results show that the algorithm proposed here clearly outperforms the 
other methods. This is mostly due to the fact that it can successfully exploit 
the structure of the covariates. The PLS based variant of the algorithm gives 
much better results than the principal- component based variant. 




Fig. 6. Visualization of the galgixy data. 



4 Application to the Gaia survey mission 

Gaia is an astrophysics mission of the European Space Agency (ESA) which 
will undertake a detailed survey of over 10^ stars in our Galaxy and extra- 
galactic objects. An important part of the scientific analysis of these data is 
the classification of all the objects as well as the estimation of stellar astro- 
physical parameters. Below we give a brief overview of the mission and the 
data before describing the results obtained using the techniques presented in 
the preceding section. 
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Table 1. Average L2 error (and standard deviation of the error) obtained for the 
galaxy example for the different methods compared. 

Training set Test set 
Using PC directions 1642.00 (578.4) 1758.44 (615.5) 
Using PLS directions 511.42 (79.3) 577.05 (101.9) 
Linear model 7420.45 (352.9) 7591.19 (377.5) 

MARS 2965.64 (334.7) 3738.76 (494.7) 

GAM 3027.14 (321.6) 3554.26 (385.5) 

PPR 2207.73 (622.0) 3317.94 (820.7) 



4.1 The astrophysical data 

After launch in 2011, Gala will study the composition and origin of our Galaxy 
by determining the properties of stars in different populations across our entire 
Galaxy [17, 41]. One of its major contributions will be to measure stellar 
distances to much higher acciiracy than has hitherto been possible (and will 
do it for a vast numbers of stars). Gaia will also measure the three-dimensional 
space motions of stars in exquisite detail. These will be used together in 
dynamical models to map out the distribution of matter, and can be used to 
answer fundamental questions concerning galaxy formation.^ 

Much of the Gaia astrometric (3D position, 3D velocity) data would be of 
little value if we did not know the intrinsic properties of the stars observed, 
quantities such as the temperature, mass, chemical compositions, radius etc. 
(collectively referred to as Astrophysical Parameters, or APs; see [3]). For this 
reason, Gaia is equipped with a low resolution spectrograph to sample the 
spectral energy distribution at 96 points across the optical and near-infrared 
wavelength range (330-1000 nm). The measurements themselves are photon 
counts (energy flux). Each object can therefore be represented as a point in a 
96-dimensional data space. For those objects which are stars, the astrophysical 
parameters of most interest are the following four: (1) effective temperature, 
which roughly corresponds to the temperature of the observable part of the 
stellar atmosphere; (2) the surface gravity, which is the acceleration due to 
gravity at the surface of the star; (3) the metallicity or abundance, a single 
measurement of the chemical composition of the star relative to that of the 
Sun; (4) the interstellar extinction, which measures how much of the star's 
light has been absorbed or scattered by interstellar dust lying between us 
and the star. In practice there is additional "cosmic variance" due to other 
APs, but these four are the main ones of interest. (In the rest of this paper 
we examine a simpler case in which there is no variance due to interstellar 
extinction.) 

After a century of progress in astrophysics, we now have sophisticated 
models of stellar structure and from these we can generate synthetic spectra 



For more information see littp://www.rssd.esa.int/Gaia 
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which reproduce real stellar spectra reasonably well. Therefore, we can con- 
struct libraries of template spectra with known APs and use these to train 
supervised regression models in order to estimate stellar APs. This has re- 
ceived quite a lot of attention in the astronomical literature, with methods 
based on nearest neighbors (e.g. [1]), neural networks (e.g. [4], [48]) and sup- 
port vector machines (e.g. [44]) 

In its simplest form, the problem of estimating APs with Gaia is one of 
finding the optimal mapping between the 96-dimensional data space and the 3 
or more dimensional AP space. In theory this can be solved directly with reg- 
ularized nonlinear regression, with the mapping inferred from simulated data. 
In practice, however, it is more complicated. For example, the mapping is not 
guaranteed to be unique, so some kind of partitioning of the data space may 
be appropriate. Furthermore, the intrinsic dimensionality of the data space is 
much lower than 96, so we could probably benefit from dimensionality reduc- 
tion. Standard PGA has been applied to such data (e.g. [4], [19]). It produces 
more robust models, but possibly at the cost of filtering out low amplitude 
features which arc nonetheless relevant for determining the "weaker" APs (see 
below). Here we investigate local reduction techniques as an alternative. 

4.2 Principal- manifold based approach 

In this section we study how the methods discussed in Sections 2 and 3 can 

be applied to Gaia spectral data. The data comprise several thousand spec- 
tra showing variance in the three astrophysical parameters temperature (in 
Kelvin), metallicity and gravity; the latter two variables are on a logarithmic 
scale.* Temperatmc is a "strong" parameter, meaning it accounts for most 
of the variance across the data set. Gravity and metallicity, in contrast, are 
weak. The parameters have a correlated impact on the data, e.g. at high tem- 
peratures, varying the metallicity has a much smaller impact on spectra than 
it does at low temperatures. The data used here are simulated with no noise 
added. 

The plot of the first three principal components of the covariatcs, the pho- 
ton counts, shows clearly that these possess some low-dimensional structure 
(figure 7), which cannot be linearly approximated. This suggests employing 
principal- manifolds based methods. 

The low-dimensional structure of the photon counts can be exploited by 
the projection-based regression tree algorithm (algorithm 5). Recall that the 
algorithm is based on the idea that the response is mainly determined by 
the projection of the covariates onto the (tangent to the) principal manifold. 
This however does not need to be the case; it might well be that the relevant 
information is not captured by the principal manifold. 

* In the present example the spectra actually have a dimensionality of just 16, 
rather than 96 as mentioned above, because when we carried out this work the 
Gaia instruments were still being developed. Nonetheless, the results we present 
are illustrative of problems typical in observational astronomy. 
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Fig. 7. First three principal components of the photon counts 

In the following wc will compare whether exploiting the manifold struc- 
ture of the data allows for obtaining better predictions of the APs. Given 
the highly nonlinear structure of the data we use support vector regression 
machines (with a Gaussian kernel) in each partition.^ The partitions are de- 
termined by the PLS-based version of algorithm 5 using 6-dimensional (grav- 
ity, metallicity) and 4-dimensional (temperature) hyperplane segments. The 
results obtained are compared to two further support vector regression ma- 
chines: one using the original data space of 16 photon counts directly, and 
one using the first 6 (gravity, metallicity) / 4 (temperature) global principal 
components as covariates. 

For this purpose, a training set of 3,000 observations, a calibration set of 
1,000 observations, and a test set of 1,000 observations is repeatedly drawn 
from the simulated Gaia data (100 replications). Table 2 gives the results. As 
support vector regression machines minimize the Li distance, we use the Li 
error J^t \lJi - J/il/"- The relative error reported is \yi ~ vMHi \Vi ^ vl- 

The results obtained for the temperature and the metallicity show that 
using the manifold structure of the photon counts allows for a significant 
improvement of the predictive performance. The hyperplanes extracted in 
algorithm 5 seem to capture the information that is relevant to predicting the 

^ The optimal cost parameter and the optimal kernel width are determined for each 
partition individually using a calibration set of 1,000 observations. 
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Table 2. Average Li error and relative error (and standard deviation of the error) 
obtained for the Gaia data. 

(a) Results obtained for the temperature 

Training set Test set 

Li error Relative err. Li error Relative err. 
Algorithm 5 252.9 (75.1) 0.042 (0.012) 258.9 (75.1) 0.043 (0.013) 
SVR (all counts) 408.6 (11.6) 0.068 (0.002) 432.1 (16.0) 0.074 (0.004) 
SVR (RComps.) 404.3 (19.6) 0.067 (0.003) 412.8 (21.8) 0.070 (0.004) 



(b) Results obtained for the gravity 

Training set Test set 

Li error Relative err. Li error Relative err. 
Algorithm 5 0.083 (0.003) 0.069 (0.002) 0.109 (0.008) 0.090 (0.005) 
SVR (all counts) 0.080 (0.001) 0.067 (0.001) 0.104 (0.005) 0.087 (0.005) 
SVR (RComps.) 0.091 (0.002) 0.076 (0.001) 0.146 (0.009) 0.118 (0.007) 



((•) Results ()l)taiii('(l for tlit^ mt^tallicity 

Training set Test set 

Li error Relative err. Li error Relative err. 
Algorithm 5 0.193 (0.018) 0.134 (0.012) 0.269 (0.016) 0.189 (0.011) 
SVR (all counts) 0.279 (0.005) 0.193 (0.003) 0.363 (0.013) 0.253 (0.008) 
SVR (RComps.) 0.256 (0.007) 0.177 (0.004) 0.389 (0.014) 0.269 (0.011) 



temperature and the metallicity whilst some of the noise is discarded in the 
projection step, which facilitates the prediction. This explains why algorithm 5 
outperforms the support vector regression machine using all photon counts. 
The principal components are less able to capture the relevant information; 
the performance of the support vector regression machine using the first 6 (or 
4, respectively) global principal components is clearly worse. 

The results obtained for the gravity, the weakest of the APs, however give 
a different picture. Using the manifold structure does not allow for improved 
predictions of the gravity. The information relevant for predicting the gravity 
seems to be "orthogonal" to the extracted hyperplane segments. Thus the 
support vector regression machine using all the photon counts performs better 
than the ones based on lower-dimensional projections. 
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5 Conclusion 

We have reviewed several approaches and algorithms for the representation of 
high-dimensional complex data structures through lower-dimensional curves 
or manifolds, which share the property of being based — by some means or 
other — on carrying out principal component analysis "locally". We have 
focused on localized versions of principal curves (algorithm 4), and on an 
"intelligent" partitioning algorithm (algorithm 2) avoiding the necessity to 
specify an initial partitioning as with usual cluster- wise ("local") PC A. 

We have demonstrated, using the latter of the two algorithms, how local- 
ized principal components can be used to reduce the dimension of the predictor 
space in a non-linear high-dimensional regression problem. The information 
relevant to predicting the response variable can be elegantly taken into ac- 
count by maximizing the covariance between the response variable and the 
extracted projection directions akin to partial least squares (PLS). We have 
applied this technique successfully to photon count data from the Gaia mis- 
sion, where wc were able to improve the predictive performance for some of 
the response variables significantly. The framework presented here however 
does not guarantee an improved prediction, especially if the information rele- 
vant to the prediction problem at hand cannot be captured by the extracted 
low-dimensional structure — as it was the case for the gravity. 

It would be desirable to be able to also apply local principal curves (algo- 
rithm 4) to such problems. In contrast to algorithm 2, LPCs have the advan- 
tage of representing the covariate space through a proper curve (or manifold) 
instead of disconnected line (or hyperplane) segments. Currently, algorithm 4 
can only extract one-dimensional curves, which it approximates by a sequence 
of points. Generalizing this idea to higher dimensions, one can approximate 
a rf-dimensional principal manifold by a d-dimensional mesh, very much like 
the elastic net algorithm proposed by Gorban and Zinovyev [26] . The (basic) 
elastic net algorithm however is a "top-down" method that iteratively bends 
a mesh of points, starting with a given topology. The generalization of algo- 
rithm 4 would use a "bottom-up" approach, i.e. learn the local topology from 
the data requiring no initialization. 

However, generalizing algorithm 4 to rf-dimensional manifolds poses a num- 
ber of challenges. Firstly, the angle penalization needs to be modified that it 
can be applied to local principal components of higher order. Further one 
has to make sure that the different branches meet, forming a proper mesh of 
points, which will require keeping the distance between subsequent /i*" con- 
stant. 
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